To build the Decision Tree model that predicts the signs_of_mental_illness of the victim from the Fatal Police Shooting Data and regularize the model and compare the performance.
To build the Decision Tree model that predicts the signs_of_mental_illness of the victim from the Fatal Police Shooting Data and regularize the model and compare the performance.
About dataset\ The dataset consists following columns
import pandas as pd
import numpy as np
from sklearn import preprocessing
import copy
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,accuracy_score,precision_score,recall_score
# calculate accuracy measures and confusion matrix
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
# For Visualizing plots
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
police = pd.read_csv("fatal-police-shootings-data.csv")
police.head()
| id | name | date | manner_of_death | armed | age | gender | race | city | state | signs_of_mental_illness | threat_level | flee | body_camera | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | Tim Elliot | 2015-01-02 | shot | gun | 53.0 | M | A | Shelton | WA | True | attack | Not fleeing | False |
| 1 | 4 | Lewis Lee Lembke | 2015-01-02 | shot | gun | 47.0 | M | W | Aloha | OR | False | attack | Not fleeing | False |
| 2 | 5 | John Paul Quintero | 2015-01-03 | shot and Tasered | unarmed | 23.0 | M | H | Wichita | KS | False | other | Not fleeing | False |
| 3 | 8 | Matthew Hoffman | 2015-01-04 | shot | toy weapon | 32.0 | M | W | San Francisco | CA | True | attack | Not fleeing | False |
| 4 | 9 | Michael Rodriguez | 2015-01-04 | shot | nail gun | 39.0 | M | H | Evans | CO | False | attack | Not fleeing | False |
police.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4478 entries, 0 to 4477 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 4478 non-null int64 1 name 4478 non-null object 2 date 4478 non-null object 3 manner_of_death 4478 non-null object 4 armed 4230 non-null object 5 age 4309 non-null float64 6 gender 4473 non-null object 7 race 4105 non-null object 8 city 4478 non-null object 9 state 4478 non-null object 10 signs_of_mental_illness 4478 non-null bool 11 threat_level 4478 non-null object 12 flee 4299 non-null object 13 body_camera 4478 non-null bool dtypes: bool(2), float64(1), int64(1), object(10) memory usage: 428.7+ KB
police.describe()
| id | age | |
|---|---|---|
| count | 4478.000000 | 4309.000000 |
| mean | 2502.721974 | 36.879322 |
| std | 1404.978671 | 13.067598 |
| min | 3.000000 | 6.000000 |
| 25% | 1286.250000 | 27.000000 |
| 50% | 2505.500000 | 35.000000 |
| 75% | 3718.750000 | 45.000000 |
| max | 4927.000000 | 91.000000 |
police.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| name | 4478 | 4332 | TK TK | 129 |
| date | 4478 | 1551 | 2018-06-29 | 9 |
| manner_of_death | 4478 | 2 | shot | 4250 |
| armed | 4230 | 84 | gun | 2489 |
| gender | 4473 | 2 | M | 4265 |
| race | 4105 | 6 | W | 2059 |
| city | 4478 | 2168 | Phoenix | 68 |
| state | 4478 | 51 | CA | 680 |
| threat_level | 4478 | 3 | attack | 2829 |
| flee | 4299 | 4 | Not fleeing | 2868 |
Display the percentage of missing values present in each column of the data and round it off to 2 decimal place.
round(police.isna().sum() / police.shape[0] * 100, 2)
# TODO: Code to fill:
# 1. '<<__A__>>' --> Fill appropriate method for finding missing values along with functions such as sum or count to calculate percentage
# 2. '<<__B__>>' --> Fill appropriate method for finding missing values along with functions such as sum or count to calculate percentage
id 0.00 name 0.00 date 0.00 manner_of_death 0.00 armed 5.54 age 3.77 gender 0.11 race 8.33 city 0.00 state 0.00 signs_of_mental_illness 0.00 threat_level 0.00 flee 4.00 body_camera 0.00 dtype: float64
# So,how many total missing values do we have?
total_cells = np.product(police.shape)
total_missing = police.isnull().sum().sum()
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)
1.5536272570662923
# Since we observe only 1.6% of null or missing values, let's drop those missing values.
# Remove all the rows that contain missing value
police = police.dropna()
police.isnull().sum()
id 0 name 0 date 0 manner_of_death 0 armed 0 age 0 gender 0 race 0 city 0 state 0 signs_of_mental_illness 0 threat_level 0 flee 0 body_camera 0 dtype: int64
# Drop redundant or not so useful columns for model building
police = police.drop(['id','date','city','name','state'], axis=1)
# We take a copy of our source data.
df = copy.deepcopy(police)
df.head()
| manner_of_death | armed | age | gender | race | signs_of_mental_illness | threat_level | flee | body_camera | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | shot | gun | 53.0 | M | A | True | attack | Not fleeing | False |
| 1 | shot | gun | 47.0 | M | W | False | attack | Not fleeing | False |
| 2 | shot and Tasered | unarmed | 23.0 | M | H | False | other | Not fleeing | False |
| 3 | shot | toy weapon | 32.0 | M | W | True | attack | Not fleeing | False |
| 4 | shot | nail gun | 39.0 | M | H | False | attack | Not fleeing | False |
le = preprocessing.LabelEncoder() # using label encoder class in the object le,
# convert the following values of these attributes to integer datattype values
# use appropriate methods from LabelEncoder class to transform the following into into binary nos
df["body_camera"]= le.fit_transform(df["body_camera"])
df["gender"]= le.fit_transform(df["gender"])
df["signs_of_mental_illness"]= le.fit_transform(df["signs_of_mental_illness"])
df["manner_of_death"]= le.fit_transform(df['manner_of_death'])
df['manner_of_death'] = df['manner_of_death'].astype('int')
# convert the following into int datatype
df['gender'] = df['gender'].astype('int')
df['signs_of_mental_illness'] = df['signs_of_mental_illness'].astype('int')
df['body_camera'] = df['body_camera'].astype('int')
# TODO: Code to fill:
# 1. '<<__A__>>' --> use appropriate method for encoding
# 2. '<<__B__>>' --> use appropriate method for encoding
# 3. '<<__C__>>' --> use appropriate method for encoding
# 4. '<<__D__>>' --> use appropriate method for encoding
# 5. '<<__E__>>' --> use appropriate method for converting to 'int' datatype
# 6. '<<__F__>>' --> use appropriate method for converting to 'int' datatype
# 7. '<<__G__>>' --> use appropriate method for converting to 'int' datatype
df.head()
| manner_of_death | armed | age | gender | race | signs_of_mental_illness | threat_level | flee | body_camera | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | gun | 53.0 | 1 | A | 1 | attack | Not fleeing | 0 |
| 1 | 0 | gun | 47.0 | 1 | W | 0 | attack | Not fleeing | 0 |
| 2 | 1 | unarmed | 23.0 | 1 | H | 0 | other | Not fleeing | 0 |
| 3 | 0 | toy weapon | 32.0 | 1 | W | 1 | attack | Not fleeing | 0 |
| 4 | 0 | nail gun | 39.0 | 1 | H | 0 | attack | Not fleeing | 0 |
# Decision tree in Python can take only numerical / categorical columns. It cannot take string / object types.
# The following code loops through each column and checks if the column type is object then converts those columns
# into categorical with each distinct value becoming a category or code.
for feature in df.columns: # Loop through all columns in the dataframe
if df[feature].dtype == 'object': # Only apply for columns with categorical strings
df[feature] = pd.Categorical(df[feature]).codes # Replace strings with an integer
# TODO: Code to fill:
# 1. '<<__A__>>' --> Fill appropriate name for iterating over columns
# 2. '<<__B__>>' --> Fill appropriate feature
# 3. '<<__C__>>' --> Fill appropriate feature
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 3663 entries, 0 to 4477 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 manner_of_death 3663 non-null int64 1 armed 3663 non-null int8 2 age 3663 non-null float64 3 gender 3663 non-null int64 4 race 3663 non-null int8 5 signs_of_mental_illness 3663 non-null int64 6 threat_level 3663 non-null int8 7 flee 3663 non-null int8 8 body_camera 3663 non-null int64 dtypes: float64(1), int64(4), int8(4) memory usage: 186.0 KB
df.head()
| manner_of_death | armed | age | gender | race | signs_of_mental_illness | threat_level | flee | body_camera | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 30 | 53.0 | 1 | 0 | 1 | 0 | 2 | 0 |
| 1 | 0 | 30 | 47.0 | 1 | 5 | 0 | 0 | 2 | 0 |
| 2 | 1 | 75 | 23.0 | 1 | 2 | 0 | 1 | 2 | 0 |
| 3 | 0 | 74 | 32.0 | 1 | 5 | 1 | 0 | 2 | 0 |
| 4 | 0 | 53 | 39.0 | 1 | 2 | 0 | 0 | 2 | 0 |
We are going to build a Decision Tree model. In this model we are going to predict signs_of_mental_illness.
Split the Independent and Dependent attributes seperately from the given datset and save those in X and y variables respectively.
X = df.drop('signs_of_mental_illness', axis=1) # Independent attributes
y = df['signs_of_mental_illness'] # Dependent attribute
# TODO: Code to fill:
# 1. '<<__A__>>' --> write code to extract only Independent attributes here and save them in X variable
# 2. '<<__B__>>' --> write code to extract only Dependent attribute here and save them in y variable
train_char_label = ['No', 'Yes']
Split the data into train and test, with a train, test ratio of 80:20
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=71)
# TODO: Code to fill:
# 1. '<<__A__>>' --> Fill approprite code to split the X and y into train and test
Build a decision tree model using appropriate class from sklearn library and fit that model on train data.
dt_model = DecisionTreeClassifier(criterion = 'entropy')
dt_model.fit(x_train, y_train)
# TODO: Code to fill:
# 1. '<<__A__>>' --> build a decision tree model using approriate class from sklearn library
# 2. '<<__B__>>' --> Fit your training data here
DecisionTreeClassifier(criterion='entropy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(criterion='entropy')
Predict the above built model on test data.
y_predict = dt_model.predict(x_test)
# TODO: Code to fill:
# 1. '<<__A__>>' --> from the built decision tree model predict your test data
print(dt_model.score(x_train , y_train))
print(dt_model.score(x_test , y_test))
0.8955631399317406 0.6971350613915416
Print confusion matrix of test data
print(classification_report(y_test,y_predict)) # display a confusion matrix of test data using approriate class from sklearn library
# TODO: Code to fill:
# 1. '<<__A__>>' --> fill appropriate code to print confusion matrix
# 2. '<<__B__>>' --> fill appropriate code to print confusion matrix
# 3. '<<__C__>>' --> fill appropriate code to print confusion matrix
precision recall f1-score support
0 0.77 0.85 0.81 548
1 0.35 0.23 0.28 185
accuracy 0.70 733
macro avg 0.56 0.54 0.54 733
weighted avg 0.66 0.70 0.67 733
from IPython.display import Image
#import pydotplus as pydot
from sklearn import tree
from os import system
decision_tree = open('decision_tree.dot','w')
dot_data = tree.export_graphviz(dt_model, out_file= decision_tree , feature_names = list(x_train), class_names = list(train_char_label))
decision_tree.close()
print (pd.DataFrame(dt_model.feature_importances_, columns = ["Imp"], index = x_train.columns))
Imp manner_of_death 0.031041 armed 0.136920 age 0.485999 gender 0.029534 race 0.097741 threat_level 0.062331 flee 0.089008 body_camera 0.067425
# You can also copy the script in the .dot file and paste it at http://webgraphviz.com/ to get tree view
# or create a .png as below
system("dot -Tpng decision_tree.dot -o decision_tree.png")
Image("decision_tree.png")
We can see that tree is overgrown
Build a pruned decision tree classifier model with criterion=entropy, max_depth=6, min_samples_split=3, min_samples_leaf=1 and fit it on train data.
reg_dt_model = DecisionTreeClassifier(criterion='entropy', max_depth=6, min_samples_split=3, min_samples_leaf=1)
reg_dt_model = reg_dt_model.fit(x_train, y_train)
# TODO: Code to fill:
# 1. '<<__A__>>' --> Apply all suggested hyperparameters in the question and build a decision tree model
# 2. '<<__B__>>' --> Fit your train data here
credit_tree_regularized = open('credit_tree_regularized.dot','w')
dot_data = tree.export_graphviz(reg_dt_model, out_file= credit_tree_regularized , feature_names = list(x_train), class_names = list(train_char_label))
credit_tree_regularized.close()
Display the feature importance of all the predictors from regularized or pruned decision tree model.
# '''Print the feature importances of all the Independendent variables or predictors to know
# which attribute has contibuted most in predition.'''
print(pd.DataFrame(reg_dt_model.feature_importances_, columns = ["Imp"], index = x_train.columns))
Imp manner_of_death 0.000000 armed 0.207914 age 0.205158 gender 0.009165 race 0.158310 threat_level 0.043464 flee 0.365008 body_camera 0.010980
# You can also copy the script in the .dot file and paste it at http://webgraphviz.com/ to get tree view
# or create a .png as below
system("dot -Tpng credit_tree_regularized.dot -o credit_tree_regularized.png")
Image("credit_tree_regularized.png")
y_predict = reg_dt_model.predict(x_test)
Check the accuracy of both train and test data of regularized decision tree model.
print(reg_dt_model.score(x_train , y_train)) # display the accuracy of train data using approriate class from sklearn.metrics library
print(reg_dt_model.score(x_test, y_test))
# TODO: Code to fill:
# 1. '<<__A__>>' --> display the accuracy of train data using approriate class from sklearn.metrics library
# 2. '<<__B__>>' --> display the accuracy of test data using approriate class from sklearn.metrics library
0.7563139931740614 0.7489768076398363
print(metrics.confusion_matrix(y_test, y_predict))
[[547 1] [183 2]]